# Introduction to Intel Xeon Phi Workshop

Computational Science Team @ NeSI

Jordi Blasco (jordi.blasco@nesi.org.nz)



#### Outline

- Intel Xeon Phi overview Why this enthusiasm with Intel Phi? Hardware specs Roadmap
- Parallel Strategies
  How easy is it?
  Perfect Candidates
  How to get the maximum performance on Intel Phi

### Why this enthusiasm about Intel Phi?

- You don't need to learn a new programming language (CUDA,...)
- You don't need to change the code in order to run on MIC.
- But,....
- The Intel MIC CPUs are slow comparing with the current Xeon.
- To get real performance you need to apply some changes.
- Is not easy, but a medium size code can be modified in few hours or days.
- Comparing with other architectures, it's like child's play.



Figure : source www.intel.com

#### System based on coprocessors in TOP500 – June 2013



Figure : source Top500 http://www.top500.org/statistics/list/



#### System based on coprocessors in TOP500 – June 2014



Figure : source Top500 http://www.top500.org/statistics/list/



## Hardware specs

### Hardware specs of 5110P (KNF)



- 60 cores/1.053 GHz/240 threads.
- 30MB cache
- 8 GB memory and 320 GB/s bandwidth.
- GDDR5 x16 channels (5.5Gbit each).
- 300 ns access!
- Linux operating system, IP addressable.
- Built using Intel's 22nm process technology.
- 512-bit Single Instruction, Multiple Data instructions (SIMD).
- 32 vector registers.

# Intel Xeon Phi Public Roadmap



Figure: source http://newsroom.intel.com/servlet/JiveServlet/download/38-32805/ISC14\_Raj\_Hazra\_keynote.pdf > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 > < 2 >

# Intel Omni Scale Fabric Roadmap



Figure : source http://newsroom.intel.com/servlet/JiveServlet/download/38-32805/ISC14\_Raj\_Hazra\_keynote.pdf  $\rightarrow$  0.5  $\rightarrow$  0.5

### Matrix addition processing in scalar and vector mode.

- SSE 128-bit (streaming) SIMD / 4 elements at once (2008).
- AVX 256-bit SIMD / 8 elements at once (2011).
- MIC 512-bit SIMD / 16 elements at once (2012).
- AVX2 256-bit SIMD / up to 32 elements at once (Q2 2013).
- AVX-512 512-bit SIMD / 16 elements at once (Q2 2015).



Figure: source: www.intel.com

# Parallel Strategies



### Perfect Candidates

#### Perfect Candidates

- Serial applications that need to be run many times.
- Massive parallel applications (OpenMP).
- Massive parallel applications (MPI).
- Massive hybrid parallel applications (MPI+OpenMP).
- Applications that can exploit the vectorial capabilities of MIC.

# How to get the maximum performance on Intel Phi

### Maximize the performance in the processor first!

- The first advise is to focus in the processor performance
- Audit the loops and find the hot-spots
- Profile the code
- Profile the MPI collectives
- Explore the Vectorization opportunities

# Questions & Answers



#### for more info

#### **Books**

- James Jeffers & James Reinders, Intel Xeon Phi Coprocessor High Performance Programming, Newnes, 2013. ISBN: 0124104940
- James Reinders, Parallel Programming and Optimization with Intel® Xeon Phi<sup>TM</sup>Coprocessors, Colfax 2013. ISBN-13: 978-0-9885234-1-8

#### Intel website (trainings and workshops)

- http://software.intel.com/en-us/mic-developer
- http://software.intel.com/en-us/intel-mkl
- http://software.intel.com/en-us/intel-composer-xe/